INTERSPEECH.2017 - Language and Multimodal

Total: 136

#1 Multimodal Markers of Persuasive Speech: Designing a Virtual Debate Coach [PDF] [Copy] [Kimi1]

Authors: Volha Petukhova ; Manoj Raju ; Harry Bunt

The study presented in this paper is carried out to support debate performance assessment in the context of debate skills training. The perception of good performance as a debater is influenced by how believable and convincing the debater’s argumentation is. We identified a number of features that are useful for explaining perceived properties of persuasive speech and for defining rules and strategies to produce and assess debate performance. We collected and analysed multimodal and multisensory data of the trainees debate behaviour, and contrasted it with those of skilled professional debaters. Observational, correlation and machine learning studies were performed to identify multimodal markers of persuasive speech and link them to experts’ assessments. A combination of multimodal in- and out-of-domain debate data, and various non-verbal, prosodic, lexical, linguistic and structural features has been computed based on our analysis and assessed used to , and several classification procedures has been applied achieving an accuracy of 0.79 on spoken debate data.

#2 Acoustic-Prosodic and Physiological Response to Stressful Interactions in Children with Autism Spectrum Disorder [PDF] [Copy] [Kimi1]

Authors: Daniel Bone ; Julia Mertens ; Emily Zane ; Sungbok Lee ; Shrikanth S. Narayanan ; Ruth Grossman

Social anxiety is a prevalent condition affecting individuals to varying degrees. Research on autism spectrum disorder (ASD), a group of neurodevelopmental disorders marked by impairments in social communication, has found that social anxiety occurs more frequently in this population. Our study aims to further understand the multimodal manifestation of social stress for adolescents with ASD versus neurotypically developing (TD) peers. We investigate this through objective measures of speech behavior and physiology (mean heart rate) acquired during three tasks: a low-stress conversation, a medium-stress interview, and a high-stress presentation. Measurable differences are found to exist for speech behavior and heart rate in relation to task-induced stress. Additionally, we find the acoustic measures are particularly effective for distinguishing between diagnostic groups. Individuals with ASD produced higher prosodic variability, agreeing with previous reports. Moreover, the most informative features captured an individual’s vocal changes between low and high social-stress, suggesting an interaction between vocal production and social stressors in ASD.

#3 A Stepwise Analysis of Aggregated Crowdsourced Labels Describing Multimodal Emotional Behaviors [PDF] [Copy] [Kimi1]

Authors: Alec Burmania ; Carlos Busso

Affect recognition is a difficult problem that most often relies on human annotated data to train automated systems. As humans perceive emotion differently based on personality, cognitive state and past experiences, it is important to collect rankings from multiple individuals to assess the emotional content in corpora, which are later aggregated with rules such as majority vote. With the increased use of crowdsourcing services for perceptual evaluations, collecting large amount of data is now feasible. It becomes important to question the amount of data needed to create well-trained classifiers. How different are the aggregated labels collected from five raters compared to the ones obtained from twenty evaluators? Is it worthwhile to spend resources to increase the number of evaluators beyond those used in conventional/laboratory studies? This study evaluates the consensus labels obtained by incrementally adding new evaluators during perceptual evaluations. Using majority vote over categorical emotional labels, we compare the changes in the aggregated labels starting with one rater, and finishing with 20 raters. The large number of evaluators in a subset of the MSP-IMPROV database and the ability to filter annotators by quality allows us to better understand label aggregation as a function of the number of annotators.

#4 An Information Theoretic Analysis of the Temporal Synchrony Between Head Gestures and Prosodic Patterns in Spontaneous Speech [PDF] [Copy] [Kimi1]

Authors: Gaurav Fotedar ; Prasanta Kumar Ghosh

We analyze the temporal co-ordination between head gestures and prosodic patterns in spontaneous speech in a data-driven manner. For this study, we consider head motion and speech data from 24 subjects while they tell a fixed set of five stories. The head motion, captured using a motion capture system, is converted to Euler angles and translations in X, Y and Z-directions to represent head gestures. Pitch and short-time energy in voiced segments are used to represent the prosodic patterns. To capture the statistical relationship between head gestures and prosodic patterns, mutual information (MI) is computed at various delays between the two using data from 24 subjects in six native languages. The estimated MI, averaged across all subjects, is found to be maximum when the head gestures lag the prosodic patterns by 30msec. This is found to be true when subjects tell stories in English as well as in their native language. We observe a similar pattern in the root mean squared error of predicting head gestures from prosodic patterns using Gaussian mixture model. These results indicate that there could be an asynchrony between head gestures and prosody during spontaneous speech where head gestures follow the corresponding prosodic patterns.

#5 Multimodal Prediction of Affective Dimensions via Fusing Multiple Regression Techniques [PDF] [Copy] [Kimi1]

Authors: D.-Y. Huang ; Wan Ding ; Mingyu Xu ; Huaiping Ming ; Minghui Dong ; Xinguo Yu ; Haizhou Li

This paper presents a multimodal approach to predict affective dimensions, that makes full use of features from audio, video, Electrodermal Activity (EDA) and Electrocardiogram (ECG) using three regression techniques such as support vector regression (SVR), partial least squares regression (PLS), and a deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN) regression. Each of the three regression techniques performs multimodal affective dimension prediction followed by a fusion of different models on features of four modalities using a support vector regression. A support vector regression is also applied for a final fusion of the three regression systems. Experiments show that our proposed approach obtains promising results on the AVEC 2015 benchmark dataset for prediction of multimodal affective dimensions. For the development set, the concordance correlation coefficient (CCC) achieves results of 0.856 for arousal and 0.720 for valence, which increases 3.88% and 4.66% of the top-performer of AVEC 2015 in arousal and valence, respectively.

#6 Co-Production of Speech and Pointing Gestures in Clear and Perturbed Interactive Tasks: Multimodal Designation Strategies [PDF] [Copy] [Kimi1]

Authors: Marion Dohen ; Benjamin Roustan

Designation consists in attracting an interlocutor’s attention on a specific object and/or location. It is most often achieved using both speech (e.g., demonstratives) and gestures (e.g., manual pointing). This study aims at analyzing how speech and pointing gestures are co-produced in a semi-directed interactive task involving designation. 20 native speakers of French were involved in a cooperative task in which they provided instructions to a partner for her to reproduce a model she could not see on a grid both of them saw. They had to use only sentences of the form ‘The [target word] goes there.’. They did this in two conditions: silence and noise. Their speech and articulatory/hand movements (motion capture) were recorded. The analyses show that the participants’ speech features were modified in noise (Lombard effect). They also spoke slower and made more pauses and errors. Their pointing gestures lasted longer and started later showing an adaptation of gesture production to speech. The condition did not influence speech/gesture coordination. The apex (part of the gesture that shows) mainly occurred at the same time as the target word and not as the demonstrative showing that speakers group speech and gesture carrying complementary rather than redundant information.

#7 The Influence of Synthetic Voice on the Evaluation of a Virtual Character [PDF] [Copy] [Kimi1]

Authors: João Paulo Cabral ; Benjamin R. Cowan ; Katja Zibrek ; Rachel McDonnell

Graphical realism and the naturalness of the voice used are important aspects to consider when designing a virtual agent or character. In this work, we evaluate how synthetic speech impacts people’s perceptions of a rendered virtual character. Using a controlled experiment, we focus on the role that speech, in particular voice expressiveness in the form of personality, has on the assessment of voice level and character level perceptions. We found that people rated a real human voice as more expressive, understandable and likeable than the expressive synthetic voice we developed. Contrary to our expectations, we found that the voices did not have a significant impact on the character level judgments; people in the voice conditions did not significantly vary on their ratings of appeal, credibility, human-likeness and voice matching the character. The implications this has for character design and how this compares with previous work are discussed.

#8 Articulatory Text-to-Speech Synthesis Using the Digital Waveguide Mesh Driven by a Deep Neural Network [PDF] [Copy] [Kimi1]

Authors: Amelia J. Gully ; Takenori Yoshimura ; Damian T. Murphy ; Kei Hashimoto ; Yoshihiko Nankaku ; Keiichi Tokuda

Following recent advances in direct modeling of the speech waveform using a deep neural network, we propose a novel method that directly estimates a physical model of the vocal tract from the speech waveform, rather than magnetic resonance imaging data. This provides a clear relationship between the model and the size and shape of the vocal tract, offering considerable flexibility in terms of speech characteristics such as age and gender. Initial tests indicate that despite a highly simplified physical model, intelligible synthesized speech is obtained. This illustrates the potential of the combined technique for the control of physical models in general, and hence the generation of more natural-sounding synthetic speech.

#9 An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis [PDF] [Copy] [Kimi1]

Authors: Sébastien Le Maguer ; Ingmar Steiner ; Alexander Hewer

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.

#10 VCV Synthesis Using Task Dynamics to Animate a Factor-Based Articulatory Model [PDF] [Copy] [Kimi1]

Authors: Rachel Alexander ; Tanner Sorensen ; Asterios Toutios ; Shrikanth S. Narayanan

This paper presents an initial architecture for articulatory synthesis which combines a dynamical system for the control of vocal tract shaping with a novel MATLAB implementation of an articulatory synthesizer. The dynamical system controls a speaker-specific vocal tract model derived by factor analysis of mid-sagittal real-time MRI data and provides input to the articulatory synthesizer, which simulates the propagation of sound waves in the vocal tract. First, parameters of the dynamical system are estimated from real-time MRI data of human speech production. Second, vocal-tract dynamics is simulated for vowel-consonant-vowel utterances using a sequence of two dynamical systems: the first one starts from a vowel vocal-tract configuration and achieves a vocal-tract closure; the second one starts from the closure and achieves the target configuration of the second vowel. Third, vocal-tract dynamics is converted to area function dynamics and is input to the synthesizer to generate the acoustic signal. Synthesized vowel-consonant-vowel examples demonstrate the feasibility of the method.

#11 Beyond the Listening Test: An Interactive Approach to TTS Evaluation [PDF] [Copy] [Kimi1]

Authors: Joseph Mendelson ; Matthew P. Aylett

Traditionally, subjective text-to-speech (TTS) evaluation is performed through audio-only listening tests, where participants evaluate unrelated, context-free utterances. The ecological validity of these tests is questionable, as they do not represent real-world end-use scenarios. In this paper, we examine a novel approach to TTS evaluation in an imagined end-use, via a complex interaction with an avatar. 6 different voice conditions were tested: Natural speech, Unit Selection and Parametric Synthesis, in neutral and expressive realizations. Results were compared to a traditional audio-only evaluation baseline. Participants in both studies rated the voices for naturalness and expressivity. The baseline study showed canonical results for naturalness: Natural speech scored highest, followed by Unit Selection, then Parametric synthesis. Expressivity was clearly distinguishable in all conditions. In the avatar interaction study, participants rated naturalness in the same order as the baseline, though with smaller effect size; expressivity was not distinguishable. Further, no significant correlations were found between cognitive or affective responses and any voice conditions. This highlights 2 primary challenges in designing more valid TTS evaluations: in real-world use-cases involving interaction, listeners generally interact with a single voice, making comparative analysis unfeasible, and in complex interactions, the context and content may confound perception of voice quality.

#12 Integrating Articulatory Information in Deep Learning-Based Text-to-Speech Synthesis [PDF] [Copy] [Kimi1]

Authors: Beiming Cao ; Myungjong Kim ; Jan van Santen ; Ted Mau ; Jun Wang

Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-to-speech (TTS) synthesis. Recently, deep learning-based TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).

#13 Approaches for Neural-Network Language Model Adaptation [PDF] [Copy] [Kimi1]

Authors: Min Ma ; Michael Nirschl ; Fadi Biadsy ; Shankar Kumar

Language Models (LMs) for Automatic Speech Recognition (ASR) are typically trained on large text corpora from news articles, books and web documents. These types of corpora, however, are unlikely to match the test distribution of ASR systems, which expect spoken utterances. Therefore, the LM is typically adapted to a smaller held-out in-domain dataset that is drawn from the test distribution. We propose three LM adaptation approaches for Deep NN and Long Short-Term Memory (LSTM): (1) Adapting the softmax layer in the Neural Network (NN); (2) Adding a non-linear adaptation layer before the softmax layer that is trained only in the adaptation phase; (3) Training the extra non-linear adaptation layer in pre-training and adaptation phases. Aiming to improve upon a hierarchical Maximum Entropy (MaxEnt) second-pass LM baseline, which factors the model into word-cluster and word models, we build an NN LM that predicts only word clusters. Adapting the LSTM LM by training the adaptation layer in both training and adaptation phases (Approach 3), we reduce the cluster perplexity by 30% on a held-out dataset compared to an unadapted LSTM LM. Initial experiments using a state-of-the-art ASR system show a 2.3% relative reduction in WER on top of an adapted MaxEnt LM.

#14 A Batch Noise Contrastive Estimation Approach for Training Large Vocabulary Language Models [PDF] [Copy] [Kimi1]

Authors: Youssef Oualil ; Dietrich Klakow

Training large vocabulary Neural Network Language Models (NNLMs) is a difficult task due to the explicit requirement of the output layer normalization, which typically involves the evaluation of the full softmax function over the complete vocabulary. This paper proposes a Batch Noise Contrastive Estimation (B-NCE) approach to alleviate this problem. This is achieved by reducing the vocabulary, at each time step, to the target words in the batch and then replacing the softmax by the noise contrastive estimation approach, where these words play the role of targets and noise samples at the same time. In doing so, the proposed approach can be fully formulated and implemented using optimal dense matrix operations. Applying B-NCE to train different NNLMs on the Large Text Compression Benchmark (LTCB) and the One Billion Word Benchmark (OBWB) shows a significant reduction of the training time with no noticeable degradation of the models performance. This paper also presents a new baseline comparative study of different standard NNLMs on the large OBWB on a single Titan-X GPU.

#15 Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition [PDF] [Copy] [Kimi1]

Authors: X. Chen ; A. Ragni ; X. Liu ; Mark J.F. Gales

Recurrent neural network language models (RNNLMs) are powerful language modeling techniques. Significant performance improvements have been reported in a range of tasks including speech recognition compared to n-gram language models. Conventional n-gram and neural network language models are trained to predict the probability of the next word given its preceding context history. In contrast, bidirectional recurrent neural network based language models consider the context from future words as well. This complicates the inference process, but has theoretical benefits for tasks such as speech recognition as additional context information can be used. However to date, very limited or no gains in speech recognition performance have been reported with this form of model. This paper examines the issues of training bidirectional recurrent neural network language models (bi-RNNLMs) for speech recognition. A bi-RNNLM probability smoothing technique is proposed, that addresses the very sharp posteriors that are often observed in these models. The performance of the bi-RNNLMs is evaluated on three speech recognition tasks: broadcast news; meeting transcription (AMI); and low-resource systems (Babel data). On all tasks gains are observed by applying the smoothing technique to the bi-RNNLM. In addition consistent performance gains can be obtained by combining bi-RNNLMs with n-gram and uni-directional RNNLMs.

#16 Fast Neural Network Language Model Lookups at N-Gram Speeds [PDF] [Copy] [Kimi1]

Authors: Yinghui Huang ; Abhinav Sethy ; Bhuvana Ramabhadran

Feed forward Neural Network Language Models (NNLM) have shown consistent gains over backoff word n-gram models in a variety of tasks. However, backoff n-gram models still remain dominant in applications with real time decoding requirements as word probabilities can be computed orders of magnitude faster than the NNLM. In this paper, we present a combination of techniques that allows us to speed up the probability computation from a neural net language model to make it comparable to the word n-gram model without any approximations. We present results on state of the art systems for Broadcast news transcription and conversational speech which demonstrate the speed improvements in real time factor and probability computation while retaining the WER gains from NNLM.

#17 Empirical Exploration of Novel Architectures and Objectives for Language Models [PDF] [Copy] [Kimi1]

Authors: Gakuto Kurata ; Abhinav Sethy ; Bhuvana Ramabhadran ; George Saon

While recurrent neural network language models based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks, Convolutional Neural Network (CNN) language models are relatively new and have not been studied in-depth. In this paper we present an empirical comparison of LSTM and CNN language models on English broadcast news and various conversational telephone speech transcription tasks. We also present a new type of CNN language model that leverages dilated causal convolution to efficiently exploit long range history. We propose a novel criterion for training language models that combines word and class prediction in a multi-task learning framework. We apply this criterion to train word and character based LSTM language models and CNN language models and show that it improves performance. Our results also show that CNN and LSTM language models are complementary and can be combined to obtain further gains.

#18 Residual Memory Networks in Language Modeling: Improving the Reputation of Feed-Forward Networks [PDF] [Copy] [Kimi1]

Authors: Karel Beneš ; Murali Karthick Baskar ; Lukáš Burget

We introduce the Residual Memory Network (RMN) architecture to language modeling. RMN is an architecture of feed-forward neural networks that incorporates residual connections and time-delay connections that allow us to naturally incorporate information from a substantial time context. As this is the first time RMNs are applied for language modeling, we thoroughly investigate their behaviour on the well studied Penn Treebank corpus. We change the model slightly for the needs of language modeling, reducing both its time and memory consumption. Our results show that RMN is a suitable choice for small-sized neural language models: With test perplexity 112.7 and as few as 2.3M parameters, they out-perform both a much larger vanilla RNN (PPL 124, 8M parameters) and a similarly sized LSTM (PPL 115, 2.08M parameters), while being only by less than 3 perplexity points worse than twice as big LSTM.

#19 Dominant Distortion Classification for Pre-Processing of Vowels in Remote Biomedical Voice Analysis [PDF] [Copy] [Kimi1]

Authors: Amir Hossein Poorjam ; Jesper Rindom Jensen ; Max A. Little ; Mads Græsbøll Christensen

Advances in speech signal analysis facilitate the development of techniques for remote biomedical voice assessment. However, the performance of these techniques is affected by noise and distortion in signals. In this paper, we focus on the vowel /a/ as the most widely-used voice signal for pathological voice assessments and investigate the impact of four major types of distortion that are commonly present during recording or transmission in voice analysis, namely: background noise, reverberation, clipping and compression, on Mel-frequency cepstral coefficients (MFCCs) — the most widely-used features in biomedical voice analysis. Then, we propose a new distortion classification approach to detect the most dominant distortion in such voice signals. The proposed method involves MFCCs as frame-level features and a support vector machine as classifier to detect the presence and type of distortion in frames of a given voice signal. Experimental results obtained from the healthy and Parkinson’s voices show the effectiveness of the proposed approach in distortion detection and classification.

#20 Automatic Paraphasia Detection from Aphasic Speech: A Preliminary Study [PDF] [Copy] [Kimi1]

Authors: Duc Le ; Keli Licata ; Emily Mower Provost

Aphasia is an acquired language disorder resulting from brain damage that can cause significant communication difficulties. Aphasic speech is often characterized by errors known as paraphasias, the analysis of which can be used to determine an appropriate course of treatment and to track an individual’s recovery progress. Being able to detect paraphasias automatically has many potential clinical benefits; however, this problem has not previously been investigated in the literature. In this paper, we perform the first study on detecting phonemic and neologistic paraphasias from scripted speech samples in AphasiaBank. We propose a speech recognition system with task-specific language models to transcribe aphasic speech automatically. We investigate features based on speech duration, Goodness of Pronunciation, phone edit distance, and Dynamic Time Warping on phoneme posteriorgrams. Our results demonstrate the feasibility of automatic paraphasia detection and outline the path toward enabling this system in real-world clinical applications.

#21 Evaluation of the Neurological State of People with Parkinson’s Disease Using i-Vectors [PDF] [Copy] [Kimi1]

Authors: N. Garcia ; Juan Rafael Orozco-Arroyave ; L.F. D’Haro ; Najim Dehak ; Elmar Nöth

The i-vector approach is used to model the speech of PD patients with the aim of assessing their condition. Features related to the articulation, phonation, and prosody dimensions of speech were used to train different i-vector extractors. Each i-vector extractor is trained using utterances from both PD patients and healthy controls. The i-vectors of the healthy control (HC) speakers are averaged to form a single i-vector that represents the HC group, i.e., the reference i-vector. A similar process is done to create a reference of the group with PD patients. Then the i-vectors of test speakers are compared to these reference i-vectors using the cosine distance. Three analyses are performed using this distance: classification between PD patients and HC, prediction of the neurological state of PD patients according to the MDS-UPDRS-III scale, and prediction of a modified version of the Frenchay Dysarthria Assessment. The Spearman’s correlation between this cosine distance and the MDS-UPDRS-III scale was 0.63. These results show the suitability of this approach to monitor the neurological state of people with Parkinson’s Disease.

#22 Objective Severity Assessment from Disordered Voice Using Estimated Glottal Airflow [PDF] [Copy] [Kimi1]

Authors: Yu-Ren Chien ; Michal Borský ; Jón Guðnason

In clinical practice, the severity of disordered voice is typically rated by a professional with auditory-perceptual judgment. The present study aims to automate this assessment procedure, in an attempt to make the assessment objective and less labor-intensive. In the automated analysis, glottal airflow is estimated from the analyzed voice signal with an inverse filtering algorithm. Automatic assessment is realized by a regressor that predicts from temporal and spectral features of the glottal airflow. A regressor trained on overtone amplitudes and harmonic richness factors extracted from a set of continuous-speech utterances was applied to a set of sustained-vowel utterances, giving severity predictions (on a scale of ratings from 0 to 100) with an average error magnitude of 14.

#23 Earlier Identification of Children with Autism Spectrum Disorder: An Automatic Vocalisation-Based Approach [PDF] [Copy] [Kimi1]

Authors: Florian B. Pokorny ; Björn Schuller ; Peter B. Marschik ; Raymond Brueckner ; Pär Nyström ; Nicholas Cummins ; Sven Bölte ; Christa Einspieler ; Terje Falck-Ytter

Autism spectrum disorder (ASD) is a neurodevelopmental disorder usually diagnosed in or beyond toddlerhood. ASD is defined by repetitive and restricted behaviours, and deficits in social communication. The early speech-language development of individuals with ASD has been characterised as delayed. However, little is known about ASD-related characteristics of pre-linguistic vocalisations at the feature level. In this study, we examined pre-linguistic vocalisations of 10-month-old individuals later diagnosed with ASD and a matched control group of typically developing individuals (N = 20). We segmented 684 vocalisations from parent-child interaction recordings. All vocalisations were annotated and signal-analytically decomposed. We analysed ASD-related vocalisation specificities on the basis of a standardised set (eGeMAPS) of 88 acoustic features selected for clinical speech analysis applications. 54 features showed evidence for a differentiation between vocalisations of individuals later diagnosed with ASD and controls. In addition, we evaluated the feasibility of automated, vocalisation-based identification of individuals later diagnosed with ASD. We compared linear kernel support vector machines and a 1-layer bidirectional long short-term memory neural network. Both classification approaches achieved an accuracy of 75% for subject-wise identification in a subject-independent 3-fold cross-validation scheme. Our promising results may be an important contribution en-route to facilitate earlier identification of ASD.

#24 Convolutional Neural Network to Model Articulation Impairments in Patients with Parkinson’s Disease [PDF] [Copy] [Kimi1]

Authors: J.C. Vásquez-Correa ; Juan Rafael Orozco-Arroyave ; Elmar Nöth

Speech impairments are one of the earliest manifestations in patients with Parkinson’s disease. Particularly, articulation deficits related to the capability of the speaker to start/stop the vibration of the vocal folds have been observed in the patients. Those difficulties can be assessed by modeling the transitions between voiced and unvoiced segments from speech. A robust strategy to model the articulatory deficits related to the starting or stopping vibration of the vocal folds is proposed in this study. The transitions between voiced and unvoiced segments are modeled by a convolutional neural network that extracts suitable information from two time-frequency representations: the short time Fourier transform and the continuous wavelet transform. The proposed approach improves the results previously reported in the literature. Accuracies of up to 89% are obtained for the classification of Parkinson’s patients vs. healthy speakers. This study is a step towards the robust modeling of the speech impairments in patients with neuro-degenerative disorders.

#25 Rescoring-Aware Beam Search for Reduced Search Errors in Contextual Automatic Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Ian Williams ; Petar Aleksic

Using context in automatic speech recognition allows the recognition system to dynamically task-adapt and bring gains to a broad variety of use-cases. An important mechanism of context-inclusion is on-the-fly rescoring of hypotheses with contextual language model content available only in real-time. In systems where rescoring occurs on the lattice during its construction as part of beam search decoding, hypotheses eligible for rescoring may be missed due to pruning. This can happen for many reasons: the language model and rescoring model may assign significantly different scores, there may be a lot of noise in the utterance, or word prefixes with a high out-degree may necessitate aggressive pruning to keep the search tractable. This results in misrecognitions when contextually-relevant hypotheses are pruned before rescoring, even if a contextual rescoring model favors those hypotheses by a large margin. We present a technique to adapt the beam search algorithm to preserve hypotheses when they may benefit from rescoring. We show that this technique significantly reduces the number of search pruning errors on rescorable hypotheses, without a significant increase in the search space size. This technique makes it feasible to use one base language model, but still achieve high-accuracy speech recognition results in all contexts.